Exploratory Data Analysis: White Wine Quality by Andrew Kinnaird

This report compares the physical characteristics and quality score of almost 5,000 white wine samples.

Univariate Plots Section

## [1] 4898   13
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
## [1] 0
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.factor      : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

The data includes a quality score and 11 physcial characteristics of 4,898 white wine samples. There are no “NA” values. The variable “X” was removed. While the quality range allowed is 0 - 10, the actual range of quality is 3 - 9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
##       x freq
## 1 FALSE 4878
## 2  TRUE   20
##       x freq
## 1 FALSE 4893
## 2  TRUE    5

The quality distribution appears to be normally distributed from 3 - 9 with about 500 more quality scores of 5 than 7. And 15 more quality scores of 3 than 9.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Both fixed.acidity and volatile.acidity are right skewed each with extreme outliers. Do fixed.acidity and volatile.acidity correlate?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

The variable citric.acid appears to be normally distributed with extreme outliers slightly below 1.25 and 1.75. Also, there is a peak near 0.5. Why is there a peak at 0.5?

## [1] "1.2"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

The majority of wines include about 1 gram/liter, with the mode being 1.2. Because of the right skew in the histogram, the data was transformed using log10 scale. This transformation revealed a bimodal distribution.

According the information provided with the data, “it’s rare to find wines with less than 1 gram/liter [of sugar] and wines with greater than 45 grams/liter [of sugar] are considered sweet.”

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

With outliers removed, the chloride distribution appears normal.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The variable “free.sulfur.dioxide” is a subset of “total.sulfur.dioxide,” therefore total.sulfur.dioxide’s mean and median are higher than free.sulfur.dioxide’s. According to the information provided with the data, “at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.” As SO2 levels increase and become evident, does the quality of the wine decrease?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Almost all density values are within a range of .005. With outliers removed, a slightly right skewed distribution is revealed. The distribution is skewed toward 1.000, the density of water (https://water.usgs.gov/edu/density.html).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH values are normally distributed. pH values increase in acidity as they approach 0 (https://water.usgs.gov/edu/ph.html). According to the information provided with the data, “most wines are between 3-4 on the pH scale.”

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## [1] "0.5"

According to the information provided with the data, sulphates are “a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich (sic) acts as an antimicrobial and antioxidant.” Is this additive added at a standard ammount leading to the mode of 0.5?

## [1] "9.4"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol levels are skewed right with a mode of 9.4. In order to address the skew in the histogram, the alcohol data was transformed using a log10 transoformation. Is there a correlation between increased alcohol and increased quality?

##  int [1:77] 6 4 6 6 4 6 5 7 5 5 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.377   6.000   8.000

As stated earlier, “it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.” The subset to examines the quality of wines with less than 1 gram/liter of sugar. This subset of 77 white wines skews toward lower quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6

As stated earlier, “it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet.” The subset to examines the quality of wines with more than 45 gram/liter of sugar. There is only one wine with 45 grams/liter of residual.sugar. It’s quality score of 6 is above the entire sample size’s mean of 5.878.

##  int [1:1229] 6 6 5 6 5 5 5 6 6 6 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.739   6.000   9.000

The subset examines the quality of wines with more than 9.9 gram/liter of sugar, which coincides with the third quartile of the entire white_wine sample. Similar to white wines with low amounts of sugar, this subset skews toward lower quality.

The mean quality for the entire sample of white wine is 5.878. The mean quality for wines with residual sugar of less than 1 is 5.377. The mean quality for wines with residual sugar greater than or equal to 45 is 6, but since there is only one wine with “high” residual sugar this result does not provide significant insight. The mean quality for wines with residual sugar greater than or equal to 9.9, the third quartile of the entire sample, is 5.739.

##  int [1:1256] 5 7 7 8 8 6 7 5 6 7 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   6.000   6.000   6.423   7.000   9.000

##  int [1:2555] 6 6 6 6 6 6 6 6 5 5 ...
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.791   6.000   9.000

The mean quality for the entire sample of white wine is 5.878. The mean quality for wines with alcohol of greater than 11.4 is 6.423. The mean quality for wines with alcohol content within the interquartile range, i.e. 9.50 and 11.40, is 5.791.

Univariate Analysis

What is the structure of your dataset?

There are 4,898 white wines in the dataset with 12 features (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). Further information, including units, is below:

  • fixed acidity (tartaric acid - g / dm^3)
  • volatile acidity (acetic acid - g / dm^3)
  • citric acid (g / dm^3)
  • residual sugar (g / dm^3)
  • chlorides (sodium chloride - g / dm^3)
  • free sulfur dioxide (mg / dm^3)
  • total sulfur dioxide (mg / dm^3)
  • density (g / cm^3)
  • pH
  • sulphates (potassium sulphate - g / dm^3)
  • alcohol (% by volume)

Output variable (based on sensory data): - quality (score between 0 and 10)

Other observations:

The data set contains no ‘NA’ values

Quality is normally distributed. The range of quality is 3-9.

Most other variables are also normally distributed, with two notable exceptions: residual sugar and alcohol.

Residual sugar’s range is from 0.600 to 65.800 with a mean of 6.391 and mode of 1.2, which created a right skewed histogram.

Alcohol’s range is from 8 to 14.20 with a mean of 10.51 and mode of 9.4, which created a right skewed histogram.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in the data set is quality. The analysis will seek to determine which features hihgly correlate with the quality of a white wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

While fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol will likely all impact the quality of white wine, I suspect alcohol and residual sugar will impact the quality of white wine more than the other variables.

Did you create any new variables from existing variables in the dataset?

No new variables were created.

Of the features you investigated, were there any unusual distributions?  

Did you perform any operations on the data to tidy, adjust, or change the   form of the data? If so, why did you do this?

The variable “X” was removed because the unique identifiers will not be utilized.

A log10 transformation was used on the right skewed residual.sugar and alcohol distributions. The resulting transformation of residual sugar appears bimoal with sugar content peaking just above 1 and just below 10. The resulting transformation of alcohol appears normal.

In several plots the x-axis was limited in order to remove outliers and establish a better understanding of the bulk of the data. Also, bin widths were adjusted to align with the significant figures for each variable.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                1.000           -0.023       0.289
## volatile.acidity            -0.023            1.000      -0.149
## citric.acid                  0.289           -0.149       1.000
## residual.sugar               0.089            0.064       0.094
## chlorides                    0.023            0.071       0.114
## free.sulfur.dioxide         -0.049           -0.097       0.094
## total.sulfur.dioxide         0.091            0.089       0.121
## density                      0.265            0.027       0.150
## pH                          -0.426           -0.032      -0.164
## sulphates                   -0.017           -0.036       0.062
## alcohol                     -0.121            0.068      -0.076
## quality                     -0.114           -0.195      -0.009
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.089     0.023              -0.049
## volatile.acidity              0.064     0.071              -0.097
## citric.acid                   0.094     0.114               0.094
## residual.sugar                1.000     0.089               0.299
## chlorides                     0.089     1.000               0.101
## free.sulfur.dioxide           0.299     0.101               1.000
## total.sulfur.dioxide          0.401     0.199               0.616
## density                       0.839     0.257               0.294
## pH                           -0.194    -0.090              -0.001
## sulphates                    -0.027     0.017               0.059
## alcohol                      -0.451    -0.360              -0.250
## quality                      -0.098    -0.210               0.008
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                       0.091   0.265 -0.426    -0.017  -0.121
## volatile.acidity                    0.089   0.027 -0.032    -0.036   0.068
## citric.acid                         0.121   0.150 -0.164     0.062  -0.076
## residual.sugar                      0.401   0.839 -0.194    -0.027  -0.451
## chlorides                           0.199   0.257 -0.090     0.017  -0.360
## free.sulfur.dioxide                 0.616   0.294 -0.001     0.059  -0.250
## total.sulfur.dioxide                1.000   0.530  0.002     0.135  -0.449
## density                             0.530   1.000 -0.094     0.074  -0.780
## pH                                  0.002  -0.094  1.000     0.156   0.121
## sulphates                           0.135   0.074  0.156     1.000  -0.017
## alcohol                            -0.449  -0.780  0.121    -0.017   1.000
## quality                            -0.175  -0.307  0.099     0.054   0.436
##                      quality
## fixed.acidity         -0.114
## volatile.acidity      -0.195
## citric.acid           -0.009
## residual.sugar        -0.098
## chlorides             -0.210
## free.sulfur.dioxide    0.008
## total.sulfur.dioxide  -0.175
## density               -0.307
## pH                     0.099
## sulphates              0.054
## alcohol                0.436
## quality                1.000
##                        [,1]
## fixed.acidity        -0.114
## volatile.acidity     -0.195
## citric.acid          -0.009
## residual.sugar       -0.098
## chlorides            -0.210
## free.sulfur.dioxide   0.008
## total.sulfur.dioxide -0.175
## density              -0.307
## pH                    0.099
## sulphates             0.054
## alcohol               0.436
## quality               1.000

The correlation betwwen all variables and ‘quality’.

##                        [,1]
## fixed.acidity         0.023
## volatile.acidity      0.071
## citric.acid           0.114
## residual.sugar        0.089
## chlorides             1.000
## free.sulfur.dioxide   0.101
## total.sulfur.dioxide  0.199
## density               0.257
## pH                   -0.090
## sulphates             0.017
## alcohol              -0.360
## quality              -0.210

The correlation betwwen all variables and ‘chlorides’.

##                        [,1]
## fixed.acidity         0.265
## volatile.acidity      0.027
## citric.acid           0.150
## residual.sugar        0.839
## chlorides             0.257
## free.sulfur.dioxide   0.294
## total.sulfur.dioxide  0.530
## density               1.000
## pH                   -0.094
## sulphates             0.074
## alcohol              -0.780
## quality              -0.307

The correlation between all variables and ‘density’.

##                        [,1]
## fixed.acidity        -0.121
## volatile.acidity      0.068
## citric.acid          -0.076
## residual.sugar       -0.451
## chlorides            -0.360
## free.sulfur.dioxide  -0.250
## total.sulfur.dioxide -0.449
## density              -0.780
## pH                    0.121
## sulphates            -0.017
## alcohol               1.000
## quality               0.436

The correlation betwwen all variables and ‘alcohol’.

# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
pairs.panels(subset(white_wine, select=-c(quality.factor)))

# http://www.sthda.com/english/wiki/scatter-plot-matrices-r-base-graphs
pairs.panels(subset(white_wine, select=c(residual.sugar, chlorides, 
                                         total.sulfur.dioxide, density, alcohol,
                                         quality)))

Quality is the variable of focus. Therefore, correlations with quality are highlighted. The strongest correlations exist between quality and the following: - chlorides (-0.210) - density (-0.307) - alcohol (0.436)

Because alcohol has the strongest correlation to quality, the relationship between alcohol and all other variables is considered. The strongest correlations exist between alcohol and the following: - density (-0.780) - residual.sugar (-0.451) - total sulfur dioxide (-0.449)

Between all variables, the strongest correlations exist between the following: - density and residual sugar (0.839) - density and alcohol (-0.780) - density and total sulfur dioxide (0.530)

As quality increases, the concentration of chlorides decreases.

Acccording to the boxplot above, no inference can be made about the correlation between quality and density.

Overall, quality increases as alcohol concentration increases. There is a decrease in alcohol concentration as quality increases from 3 to 5.

As the concentration of alcohol increases, density decreases.

As the concentration of alcohol increases, residual sugar decreases especially from an alcohol concentration of 8-10%.

As the concentration of alcohol increases, total sulfur dioxide decreases.

As density increases, residual sugar increases.

As density increases, alcohol concentration decreases.

As density increases, total sulfur dioxide increases.

##      residual.sugar density alcohol total.sulfur.dioxide
## 2782           65.8 1.03898    11.7                  160

In the plots above, there is an outlier at a density just below 1.04. This is a result of the high concentration of residual sugar in that wine.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features

in the dataset?

Quality correlates most strongly with chlorides, density, and alcohol.

As quality increases, the concentration of chlorides decreases.

As quality increases, density decreases.

As quality increases, the concentration of alcohol increases. Except from a quality change from 3 - 5 where alcohol concentration decrease as quality increases.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Because alcohol has the strongest correlation to quality, the relationship between alcohol and all other variables was considered. The strongest correlations exist between alcohol and the following: density, residual sugar, total sulfur dioxide.

As the concentration of alcohol increases, density decreases. “Alcohol, or ethanol, is the intoxicating agent found in beer, wine and liquor.” https://www.drugs.com/alcohol.html The density of ethanol is 0.7893 (https://pubchem.ncbi.nlm.nih.gov/compound/ethanol#section=Density). Therefore, as the concentration of alcohol increases, the density of the wine decreases.

As the concentration of alcohol increases, residual sugar decreases especially from an alcohol concentration of 8-10%. This is a by product of alcohol production. “[W]hen winemaking happens, yeast eats sugar and makes ethanol (alcohol) as a by-product.” (https://winefolly.com/review/sugar-in-wine-chart/) The sugar leftover after this process is called “residual sugar.” (https://bit.ly/2K2nFv1) Therfore as the concentration of alcohol increases, the amount of sugar “eaten” by yeast increases, thus the amount residual sugar in the wine decreases and the percentage of alcohol increases simultaneously.

As the concentration of alcohol increases, total sulfur dioxide decreases. “Sulfur dioxide (SO2) is important in the winemaking process as it aids in preventing microbial growth[…].” https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472855/ “[Alcohol] acts synergistically and enhances the bacteria-killing effect of molecular SO2 [sulfur dioxide], so high-alcohol wines require less SO2 protection” (https://www.extension.purdue.edu/extmedia/fs/fs-52-w.pdf). Therefore, as alcohol concentration increases, the amount of total sulfur dioxide decreases.

What was the strongest relationship you found?

The strongest correlations between any variables were condsidered. The variables were density and each of the following: residual sugar, alcohol, and total sulfur dioxide.

As density increases, residual sugar increases. This correlation occurs becuase “[t]he more sugar that’s mixed into a measured amount of water, the higher the density of the mixture.” (https://www.stevespanglerscience.com/lab/experiments/sugar-rainbow/)

As density increases, alcohol concentration decreases. “Alcohol, or ethanol, is the intoxicating agent found in beer, wine and liquor.” (https://www.drugs.com/alcohol.html) The density of ethanol is 0.7893 (https://pubchem.ncbi.nlm.nih.gov/compound/ethanol#section=Density). Therefore, as the concentration of alcohol increases, the density of the wine increases.

As density increases, total sulfur dioxide increases. The density of sulfur dioxide is 1.434. (https://pubchem.ncbi.nlm.nih.gov/compound/sulfur_dioxide#section=Density) Therefore as the concentration of sulfur dioxide increases, the density of wine increases.

Multivariate Plots Section

As density decreases and alcohol increases, the quality scores appear to increase.

As residual sugar decreases and alcohol increases, the quality scores appear to
increase.

As total sulfur dioxide decrease and alcohol increase, quailty scores appear to increase.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

The relationship between density, alcohol, and quality appears to be the strongest. The relationships between residual sugar / alcohol appeared to show as residual sugar decreased and alcohol increased quality increased, but there does not appear to be a strong relationship. A similar result was shown by the plot of total sulfur dioxide and alcohol.

Were there any interesting or surprising interactions between features?

As density and alcohol were most closely correlated with quality and density and alcohol were closely correlated with one another, the strength of the relationship betwen density, alcohol, and quality was not surprising.


Final Plots and Summary

Plot One

print(mean(white_wine$quality))
## [1] 5.877909

Description One

The plot “Quality of Wine” contains the quality score data ranging from 1-10 on 4,898 wines. Wine quality is the feature of focus for this study. While the allowed range of scores is 1-10, only scores of 3-9 were given. The mean quality score is 5.878.

Plot Two

Description Two

Plot two highlights the strongest correlation between two variables: 0.84 between residual sugar and density. This relatsionship is explained by the density of sugar and the contents of wine. “A wine typically contains ethanol (~13%) [and] water (85%)… .” (https://bit.ly/2MXqAXh) The density of water is 1.00. (https://water.usgs.gov/edu/density.html) Because wine is 85% water, the densities of the wines in the sample are near 1.00. An increase of residual sugar (density of 1.56) increases the density of the wine (https://bit.ly/2KNqlkd). This correlation lead to further research showing residual sugar, total sulfur dioxide, and alcohol are dependent upon one another. The amount of alcohol in a wine depends upon the amount of sugar “eaten” by yeast (https://winefolly.com/review/sugar-in-wine-chart/): the more sugar eaten, the more alcohol in the wine, the less residual sugar remains. (https://bit.ly/2K2nFv1) Then as the concentration of alcohol increases, total sulfur dioxide decreases, because less sulfur dioxide needs to be added to prevent microbial growth.(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3472855/) Therefore, a wine with less residual sugar will have more alcohol and a wine with more alcohol will have less total sulfur dioxide, creating a less dense wine.

Plot Three

Description Three

Plot three highlights the relationship between alcohol, density, and quality. As described in plot two, density and alcohol correlate with one another. These factors also appear to correlate with a higher quality score, as density decreases and alcohol increases (as described in plot two, residual sugar and total sulfur dioxide also decrease), the quality scores increase. ——

Reflection

The analysis performed examined a sample of white wine which included 4,898 observations of 13 variables. The variable of focus was the wine’s quality score. Initial analysis sought to understand the relationship between quality and all other variables. Unfortunately, there wasn’t a single variable which strongly correlated with quality. Surprisingly, this analysis revealed a stronger relationship between density and three other variables: residual sugar, total sulfur dioxide, and alcohol. Further research revealed the dependent nature of the relatinship of these variables. Quality was then plotted against alcohol and density which showed that quality does correlate with an increase in alcohol concentration and a decrease in density. Further research should seek to isolate the dependent factors within the dataset in order to control for amounts of residual sugar and total sulfur dioxide.